xxxxxxxxxxAnalyse SAT and ACT Scores from previous years to provide Recommendations on where money can be spent to increase SAT participation rates.Analyse SAT and ACT Scores from previous years to provide Recommendations on where money can be spent to increase SAT participation rates.
xxxxxxxxxx## Executive Summary### Contents:- [Import Libraries](#Import-Libraries)- [Data Import and Cleaning - 2017](#Data-Import-and-Cleaning:-2017)- [Data Dictionary](#Data-Dictionary)- [Data Import and Cleaning: 2018 and 2019](#Data-Import-and-Cleaning:-2018-and-2019)- [Exploratory Data Analysis](#Exploratory-Data-Analysis)- [First Impressions: SAT Participation Rate Is Getting Better!](#First-Impressions:-SAT-Participation-Rate-Is-Getting-Better!)- [Visualizations](#Visualization)- [Participation Rate: Box Plots tells it all!](#Participation-Rate:-Box-Plots-tells-it-all!)- [My Choice, top 3 states](#My-Choice,-top-3-states)- [Conclusions](#Conclusions)- [Recommendations](#Recommendations)- [What Next!](#What-Next!)xxxxxxxxxx**If you combine your problem statement, executive summary, data dictionary, and conclusions/recommendations, you have an amazing README.md file that quickly aligns your audience to the contents of your project.** Don't forget to cite your data sources!If you combine your problem statement, executive summary, data dictionary, and conclusions/recommendations, you have an amazing README.md file that quickly aligns your audience to the contents of your project. Don't forget to cite your data sources!
xxxxxxxxxx*All libraries used should be added here*All libraries used should be added here
## Importing all required librariesimport numpy as npimport pandas as pdimport scipy.stats as statsimport seaborn as snsimport matplotlib.pyplot as pltsns.set_style('whitegrid')%config InlineBackend.figure_format = 'retina'%matplotlib inlinexxxxxxxxxx xxxxxxxxxx xxxxxxxxxx#### 1. Read In SAT & ACT DataRead in the `sat_2017.csv` and `act_2017.csv` files and assign them to appropriately named pandas dataframes.Read in the sat_2017.csv and act_2017.csv files and assign them to appropriately named pandas dataframes.
x
## Creating backup dataframes,sat2017_master = pd.read_csv('data/sat_2017.csv')act2017_master = pd.read_csv('data/act_2017.csv')## Will use these DataFrames instead,sat2017 = sat2017_master.copy()act2017 = act2017_master.copy()xxxxxxxxxx#### 2. Display DataPrint the first 10 rows of each dataframe to your jupyter notebookPrint the first 10 rows of each dataframe to your jupyter notebook
x
## First 10 rows for SAT dataframe,sat2017.head(10)x
## First 10 rows for ACT dataframe,act2017.head(10)xxxxxxxxxx#### 3. Verbally Describe DataTake your time looking through the data and thoroughly describe the data in the markdown cell below. Take your time looking through the data and thoroughly describe the data in the markdown cell below.
x
**Answer:**<p>The tables above displays US state-wise information on the SAT and ACT tests conducted for the year of 2017. The data comprises of the following,1. SAT (2017) - Math, ERW section scores being marked out of 800 (each), along with Total Scores, marked out of 1600 and finally the Participation Rate in percentages. 2. ACT (2017) - English, Math, Reading and Science Section scores being marked out of 36, along with Composite Scores for all 4 sections marked out of 36, and the Participation Rates in percentage for the same year.</p> <p>Incosistent Datatype - Important thing to consider here is total SAT scores are marked in integer values while the ACT score are floating point numbers.</p><p> Few Columns and Cells are named inconsistently. Ex. Last value in ACT - Composite Column has an 'x' while it should be all numeric type.</p><p>Lastly, I notice the ACT data has national average scores as first index while SAT seems to not provide that information.</p>Answer:
The tables above displays US state-wise information on the SAT and ACT tests conducted for the year of 2017. The data comprises of the following,
SAT (2017) - Math, ERW section scores being marked out of 800 (each), along with Total Scores, marked out of 1600 and finally the Participation Rate in percentages.
ACT (2017) - English, Math, Reading and Science Section scores being marked out of 36, along with Composite Scores for all 4 sections marked out of 36, and the Participation Rates in percentage for the same year.
Incosistent Datatype - Important thing to consider here is total SAT scores are marked in integer values while the ACT score are floating point numbers.
Few Columns and Cells are named inconsistently. Ex. Last value in ACT - Composite Column has an 'x' while it should be all numeric type.
Lastly, I notice the ACT data has national average scores as first index while SAT seems to not provide that information.
x
**Answer:**No missing or null values for both dataframes. We will need to convert the dtype for participation and composite columns to int/float values.Answer:
No missing or null values for both dataframes. We will need to convert the dtype for participation and composite columns to int/float values.
## Lets check for any null/missing values and datatypes,sat2017.info()xxxxxxxxxxact2017.info()xxxxxxxxxx#### 4b. Are there any obvious issues with the observations?**What is the minimum *possible* value for each test/subtest? What is the maximum *possible* value?**Consider comparing any questionable values to the sources of your data:- [SAT](https://blog.collegevine.com/here-are-the-average-sat-scores-by-state/)- [ACT](https://blog.prepscholar.com/act-scores-by-state-averages-highs-and-lows)xxxxxxxxxx**Answer:**+ SAT - Math section has a minimim score if 52, we will consider this as an outlier of our dataset.+ ACT - Science section has a minimum score of 2.3, we will consider this as an outlier of our dataset.Answer:
sat2017.min()act2017.min()xxxxxxxxxx#### 4c. Fix any errors you identified**The data is available** so there's no need to guess or calculate anything. If you didn't find any errors, continue to the next step.The data is available so there's no need to guess or calculate anything. If you didn't find any errors, continue to the next step.
xxxxxxxxxx**Answer:**+ The last index for ACT, under composite column the value contains additional characters that will be an obstacle later on. Lets change that.Answer:
sat2017.head(3)xxxxxxxxxxact2017.tail(3)xxxxxxxxxx## 'Composite' column from ACT dataframe has 1 value## that might give us some issues later on, lets replace it.x
##.iat helps access single value by integer positionact2017['Composite'].iat[-1]xxxxxxxxxxact2017['Composite'].iat[-1] = act2017['Composite'].iat[-1].replace('x','')act2017.tail(2)xxxxxxxxxx#### 5. What are your data types? Display the data types of each feature. Display the data types of each feature.
sat2017.dtypesxxxxxxxxxx## Looks like the datatypes are inconsistent## Ex. except the 'State' column everything should be either Float/Integeract2017.dtypesxxxxxxxxxxWhat did you learn?- Do any of them seem odd? - Which ones are not as they should be? What did you learn?
x
**Answer:**+ Yes. The 'Participation' & 'Composite' columns might have some missing values, they should be numeric. + Also, looks like all other column can be either integers/floats except the 'State' column, which can be of object datatype.Answer:
xxxxxxxxxx#### 6. Fix Incorrect Data TypesBased on what you discovered above, use appropriate methods to re-type incorrectly typed data.- Define a function that will allow you to convert participation rates to an appropriate numeric type. Use `map` or `apply` to change these columns in each dataframe.Based on what you discovered above, use appropriate methods to re-type incorrectly typed data.
map or apply to change these columns in each dataframe.## Custom defined function - ## Converts datatypes for select columns to Floating point numbersdef tonumeric(col): try: out = float(col) except: out = np.nan return out## Returns NaN if fails to typecast.xxxxxxxxxx- Fix any individual values preventing other columns from being the appropriate type.x
## First, remove '%' sign from 'Participation' column## Second, drop the original 'Participation' column## Third, .apply our custom functionsat2017['Participation_%'] = sat2017['Participation'].apply(lambda x:x.split("%")[0])act2017['Participation_%'] = act2017['Participation'].apply(lambda x:x.split("%")[0])x
## Drop original columns to avoid confusion / duplicate informationsat2017.drop('Participation',axis=1,inplace=True)act2017.drop('Participation',axis=1,inplace=True)xxxxxxxxxxsat2017['Participation_%'] = sat2017['Participation_%'].apply(tonumeric)act2017['Participation_%'] = act2017['Participation_%'].apply(tonumeric)xxxxxxxxxx- Finish your data modifications by making sure the columns are now typed appropriately.## Convert all other columns to float for consistency purposesact2017['Composite'] = act2017['Composite'].apply(tonumeric)sat2017['Evidence-Based Reading and Writing'] = sat2017['Evidence-Based Reading and Writing'].apply(tonumeric)sat2017['Math'] = sat2017['Math'].apply(tonumeric)sat2017['Total'] = sat2017['Total'].apply(tonumeric)xxxxxxxxxx- Display the data types again to confirm they are correct.xxxxxxxxxxsat2017.dtypesact2017.dtypesxxxxxxxxxx#### 7. Rename ColumnsChange the names of the columns to more expressive names so that you can tell the difference the SAT columns and the ACT columns. Your solution should map all column names being changed at once (no repeated singular name-changes). **We will be combining these data with some of the data from 2018 and 2019, and so you should name columns in an appropriate way**.**Guidelines**:- Column names should be all lowercase (you will thank yourself when you start pushing data to SQL later in the course)- Column names should not contain spaces (underscores will suffice--this allows for using the `df.column_name` method to access columns in addition to `df['column_name']`.- Column names should be unique and informative (the only feature that we actually share between dataframes is the state).Change the names of the columns to more expressive names so that you can tell the difference the SAT columns and the ACT columns. Your solution should map all column names being changed at once (no repeated singular name-changes). We will be combining these data with some of the data from 2018 and 2019, and so you should name columns in an appropriate way.
Guidelines:
df.column_name method to access columns in addition to df['column_name'].x
sat2017.columns## Lets start by renaming the columns for SAT dataframesat2017.rename(columns={'State':'state', 'Evidence-Based Reading and Writing':'erw', 'Math':'math', 'Total':'total', 'Participation_%':'participation_percentage' },inplace=True)## Add suffix to each namessat2017 = sat2017.add_suffix('_sat17')## renaming columns for ACT dataframeact2017.rename(columns={'State':'state', 'English':'english', 'Math':'math', 'Reading':'reading', 'Science':'science', 'Composite':'composite', 'Participation_%':'participation_percentage' },inplace=True)act2017 = act2017.add_suffix('_act17')xxxxxxxxxxsat2017.columnsxxxxxxxxxxact2017.columnsxxxxxxxxxx xxxxxxxxxx xxxxxxxxxx#### 8. Create a data dictionaryNow that we've fixed our data, and given it appropriate names, let's create a [data dictionary](http://library.ucmerced.edu/node/10249). A data dictionary provides a quick overview of features/variables/columns, alongside data types and descriptions. The more descriptive you can be, the more useful this document is.Example of a Fictional Data Dictionary Entry: |Feature|Type|Dataset|Description||---|---|---|---||**county_pop**|*integer*|2010 census|The population of the county (units in thousands, where 2.5 represents 2500 people).| |**per_poverty**|*float*|2010 census|The percent of the county over the age of 18 living below the 200% of official US poverty rate (units percent to two decimal places 98.10 means 98.1%)|[Here's a quick link to a short guide for formatting markdown in Jupyter notebooks](https://jupyter-notebook.readthedocs.io/en/stable/examples/Notebook/Working%20With%20Markdown%20Cells.html).Provided is the skeleton for formatting a markdown table, with columns headers that will help you create a data dictionary to quickly summarize your data, as well as some examples. **This would be a great thing to copy and paste into your custom README for this project.**Now that we've fixed our data, and given it appropriate names, let's create a data dictionary.
A data dictionary provides a quick overview of features/variables/columns, alongside data types and descriptions. The more descriptive you can be, the more useful this document is.
Example of a Fictional Data Dictionary Entry:
| Feature | Type | Dataset | Description |
|---|---|---|---|
| county_pop | integer | 2010 census | The population of the county (units in thousands, where 2.5 represents 2500 people). |
| per_poverty | float | 2010 census | The percent of the county over the age of 18 living below the 200% of official US poverty rate (units percent to two decimal places 98.10 means 98.1%) |
Here's a quick link to a short guide for formatting markdown in Jupyter notebooks.
Provided is the skeleton for formatting a markdown table, with columns headers that will help you create a data dictionary to quickly summarize your data, as well as some examples. This would be a great thing to copy and paste into your custom README for this project.
xxxxxxxxxx|Feature|Type|Dataset|Description||---|---|---|---||column name|int/float/object|ACT/SAT|This is an example| | Feature | Type | Dataset | Description |
|---|---|---|---|
| column name | int/float/object | ACT/SAT | This is an example |
xxxxxxxxxx<center><strong><span style="color:maroon"> SAT Scores (Scholastic Assessment Test) </span></strong></center>|Feature|Type|Dataset|Description||---|---|---|---||states (PK)|object|SAT| Names of different __US States__ ordered <br> alphabetically in ascending order ||erw|object|SAT|Section score for Evidence based<br> Reading & Writing, marked<br> between __200 & 800__ | |math|object|SAT|Section score for Math,<br> marked between __200 & 800__ | |total|object|SAT|Sum of both the Section scores,<br> ranges from __400 to 1600__ | |participation_<br>percentage|float|SAT|__Percentage (%)__ of students from<br>each State that took the test||||| more on SAT scores [here][1]|[1]: https://blog.prepscholar.com/sat-score-range "SAT Score" <center><strong><span style="color:maroon"> ACT Scores (American College Testing) </span></strong> </center>|Feature|Type|Dataset|Description||---|---|---|---||states (PK)|object|ACT| Names of different __US States__ ordered <br> alphabetically in ascending order | |english|float|ACT|sub-core for English section,<br> ranges from __1 to 36__ | |math|float|ACT|sub-core for Math section,<br> ranges from __1 to 36__| |reading|float|ACT|sub-core for Reading section,<br> ranges from __1 to 36__| |science|float|ACT|sub-core for Science section,<br> ranges from __1 to 36__| |composite/<br>total|float|ACT|Mean (Average) score based on English,<br> Math, Reading and Science subscores,<br> also ranges from __1 to 36__||participation_<br>percentage|float|ACT|__Percentage (%)__ of students from<br>each State that took the test|||| more on ACT scores [here][2][2]: https://www.act.org/content/act/en/products-and-services/the-act/scores/understanding-your-scores.html "ACT Score"<br><center><sub>*NOTE: Suffix: Ex. _act17 or _sat17 in dataframes indicates features from ACT or SAT Test 2017<br>*(PK) Primary Key column</sub></center>| Feature | Type | Dataset | Description |
|---|---|---|---|
| states (PK) | object | SAT | Names of different US States ordered alphabetically in ascending order |
| erw | object | SAT | Section score for Evidence based Reading & Writing, marked between 200 & 800 |
| math | object | SAT | Section score for Math, marked between 200 & 800 |
| total | object | SAT | Sum of both the Section scores, ranges from 400 to 1600 |
| participation_ percentage |
float | SAT | Percentage (%) of students from each State that took the test |
| more on SAT scores here |
| Feature | Type | Dataset | Description |
|---|---|---|---|
| states (PK) | object | ACT | Names of different US States ordered alphabetically in ascending order |
| english | float | ACT | sub-core for English section, ranges from 1 to 36 |
| math | float | ACT | sub-core for Math section, ranges from 1 to 36 |
| reading | float | ACT | sub-core for Reading section, ranges from 1 to 36 |
| science | float | ACT | sub-core for Science section, ranges from 1 to 36 |
| composite/ total |
float | ACT | Mean (Average) score based on English, Math, Reading and Science subscores, also ranges from 1 to 36 |
| participation_ percentage |
float | ACT | Percentage (%) of students from each State that took the test |
| more on ACT scores here |
xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx#### 9. Drop unnecessary rowsOne of our dataframes contains an extra row. Identify and remove this from the dataframe.One of our dataframes contains an extra row. Identify and remove this from the dataframe.
xxxxxxxxxx**Answer:**+ ACT has 1 additional row ('National) compared to SAT.+ In order to merge on columns later, we will need to have equal number of rows for all data frames. So the best solution is to drop that row. Answer:
act2017.head(3)xxxxxxxxxx## Drop the Index at position = 0act2017.drop(0, inplace=True)## Reset all index values to begin at 0, consistent with other dataframesact2017.reset_index(inplace=True)## Drop the old index columnact2017.drop('index',axis=1,inplace=True)act2017.shapexxxxxxxxxxact2017.head(3)xxxxxxxxxx#### 10. Merge DataframesJoin the 2017 ACT and SAT dataframes using the state in each dataframe as the key. Assign this to a new variable.Join the 2017 ACT and SAT dataframes using the state in each dataframe as the key. Assign this to a new variable.
xxxxxxxxxx## Before we join both dataframes on state column## lets rename it 'states' for bothsat2017.rename(columns={'state_sat17':'states'}, inplace=True)act2017.rename(columns={'state_act17':'states'}, inplace=True)xxxxxxxxxx## Merge sat_2017 & act_2017 dataframes - cleanedsat_act_2017 = pd.merge(sat2017,act2017, on='states')xxxxxxxxxx## Nothing looks suspicious with this dataframesat_act_2017.head(3)x
sat_act_2017.shapexxxxxxxxxx#### 11. Save your cleaned, merged dataframeUse a relative path to save out your data as `combined_2017.csv`.Use a relative path to save out your data as combined_2017.csv.
xxxxxxxxxx## Save using provided file name - without indexsat_act_2017.to_csv('data/combined_2017.csv', index=False)xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxxRead in the `sat_2018.csv`, `sat_2019.csv`, `act_2018.csv` and `act_2019.csv` files and assign them to appropriately named pandas dataframes. For the **2018-2019 ACT Data**, only the `Composite` scores are available. Repeat the same processes to clean the 2018-2019 data here as you were instructed in the previous sections above.Read in the sat_2018.csv, sat_2019.csv, act_2018.csv and act_2019.csv files and assign them to appropriately named pandas dataframes. For the 2018-2019 ACT Data, only the Composite scores are available. Repeat the same processes to clean the 2018-2019 data here as you were instructed in the previous sections above.
## Lets create backup dataframessat2018_master = pd.read_csv('data/sat_2018.csv')act2018_master = pd.read_csv('data/act_2018.csv')## 2019 data needs to be opened using different encoding typesat2019_master = pd.read_csv('data/sat_2019.csv', encoding='ISO-8859-1')act2019_master = pd.read_csv('data/act_2019.csv', encoding='ISO-8859-1')## our dataframes to usesat2018 = sat2018_master.copy()act2018 = act2018_master.copy()sat2019 = sat2019_master.copy()act2019 = act2019_master.copy()xxxxxxxxxx## First, remove '%' sign from 'Participation' column## Second, drop the original 'Participation' column## Third, .apply our custom functionsat2018['Participation_%'] = sat2018['Participation'].apply(lambda x:x.split("%")[0])act2018['Participation_%'] = act2018['Participation'].apply(lambda x:x.split("%")[0])sat2019['Participation_%'] = sat2019['Participation Rate'].apply(lambda x:x.split("%")[0])act2019['Participation_%'] = act2019['Participation'].apply(lambda x:x.split("%")[0])x
## Drop original columns to avoid confusion / duplicate informationsat2018.drop('Participation',axis=1,inplace=True)act2018.drop('Participation',axis=1,inplace=True)sat2019.drop('Participation Rate',axis=1,inplace=True)act2019.drop('Participation',axis=1,inplace=True)xxxxxxxxxx**Lets Clean SAT - 2019 dataframe first,**Lets Clean SAT - 2019 dataframe first,
sat2019.tail(10)x
sat2019['Participation_%'].unique()sat2019[sat2019['Participation_%'] == 'Ñ'][['State','Participation_%']]act2019['Participation_%'].unique()## Looks like States with missing participation information are## new additions for the list (39 - Peurto Rico, 47 - Virgin Islands)## Without 'Participation_%' and being only present for 2019 - they won't be of much use for us.## Lets drop them, reset the index and delete the old index columnxxxxxxxxxx## Drop those 2 index valuessat2019.drop([39,47], inplace=True)## Reset the index to be consistent with other dataframessat2019.reset_index(inplace=True)xxxxxxxxxx## Drop duplicate index columnsat2019.drop(columns=['index'], inplace=True)sat2019.shapexxxxxxxxxx**Clean ACT - 2019 dataframe**Clean ACT - 2019 dataframe
act2019.tail(10)x
## This dataframe has an additional "National" Row at index 51.## Test -> In reality we know that no such state exists.## Lets drop that indexact2019.drop([51],inplace=True)xxxxxxxxxxact2019.shapexxxxxxxxxx**Clean SAT & ACT - 2018 dataframes**Clean SAT & ACT - 2018 dataframes
xxxxxxxxxxact2018.iloc[15:25]xxxxxxxxxxact2018.drop(20, inplace=True)act2018.reset_index(inplace=True)act2018.drop('index',axis=1, inplace=True)act2018.shapeact2018.head(10)xxxxxxxxxx ## Convert all columns to be numeric (float here) and rename featuressat2018['Participation_%'] = sat2018['Participation_%'].apply(tonumeric)act2018['Participation_%'] = act2018['Participation_%'].apply(tonumeric)sat2019['Participation_%'] = sat2019['Participation_%'].apply(tonumeric)act2019['Participation_%'] = act2019['Participation_%'].apply(tonumeric)sat2019.info()act2019.info()sat2018.info()act2018.info()xxxxxxxxxx## Inconsistent datatypes and column namesxxxxxxxxxxsat2018.rename(columns={'State':'state','Evidence-Based Reading and Writing':'erw','Math':'math','Total':'total','Participation_%':'participation_percentage'},inplace=True)act2018.rename(columns={'State':'state','Composite':'composite','Participation_%':'participation_percentage'},inplace=True)sat2019.rename(columns={'State':'state','EBRW':'erw','Math':'math','Total':'total','Participation_%':'participation_percentage'},inplace=True)act2019.rename(columns={'State':'state','Composite':'composite','Participation_%':'participation_percentage'},inplace=True)xxxxxxxxxx## add suffixsat2018 = sat2018.add_suffix('_sat18')act2018 = act2018.add_suffix('_act18')sat2019 = sat2019.add_suffix('_sat19')act2019 = act2019.add_suffix('_act19')## convert all columns to floatssat2018['total_sat18'] = sat2018['total_sat18'].apply(tonumeric)sat2018['erw_sat18'] = sat2018['erw_sat18'].apply(tonumeric)sat2018['math_sat18'] = sat2018['math_sat18'].apply(tonumeric)sat2019['total_sat19'] = sat2019['total_sat19'].apply(tonumeric)sat2019['erw_sat19'] = sat2019['erw_sat19'].apply(tonumeric)sat2019['math_sat19'] = sat2019['math_sat19'].apply(tonumeric)xxxxxxxxxxsat2018.info()act2018.info()xxxxxxxxxxsat2019.info()xxxxxxxxxxact2019.info()xxxxxxxxxx xxxxxxxxxx## Primary Key Column before we merge## We can definitely use on_left and on_right merge, but I feel this is easier for me.## all 'states'sat2018.rename(columns={'state_sat18':'states'},inplace=True)sat2019.rename(columns={'state_sat19':'states'},inplace=True)act2018.rename(columns={'state_act18':'states'}, inplace=True)act2019.rename(columns={'state_act19':'states'}, inplace=True)xxxxxxxxxx xxxxxxxxxx**Before we commit - state 'District Of Columbia' needs to be consistent across all dataframe**Before we commit - state 'District Of Columbia' needs to be consistent across all dataframe
## Index ID 8 'District of Columbia' is not consistently named across## all dataframes, lets rename it for the following dataframes,## act2018, sat,2018, act2019, sat2019 and sat_act_2017 (already merged)## Soln: titlecase all state namesxxxxxxxxxx ## for Merged 2017 (sat_act) dataframe,sat_act_2017['states'] = [state.title() for state in sat_act_2017['states']]sat2018['states'] = [state.title() for state in sat2018['states']]sat2019['states'] = [state.title() for state in sat2019['states']]act2018['states'] = [state.title() for state in act2018['states']]act2019['states'] = [state.title() for state in act2019['states']]xxxxxxxxxx xxxxxxxxxx#### Combine your 2017 and 2018 data into a single dataframeJoining on state names should work, assuming you formatted all your state names identically. Make sure none of your columns (other than state) have identical names. Do yourself a favor and decide if you're encoding participation rates as floats or integers and standardize this across your datasets.Save the contents of this merged dataframe as `final.csv`.**Use this combined dataframe for the remainder of the project**.Joining on state names should work, assuming you formatted all your state names identically. Make sure none of your columns (other than state) have identical names. Do yourself a favor and decide if you're encoding participation rates as floats or integers and standardize this across your datasets.
Save the contents of this merged dataframe as final.csv.
Use this combined dataframe for the remainder of the project.
x
## Merge 2018 (SAT+ACT)sat_act_2018 = pd.merge(sat2018,act2018,on='states')xxxxxxxxxx## Merge 2019 (SAT+ACT)sat_act_2019 = pd.merge(sat2019,act2019,on='states')## We already havesat_act_2017.head(0)xxxxxxxxxxsat_act_2017.shapesat_act_2018.shapexxxxxxxxxxsat_act_2019.shape### Merge 3 dataframes simultaneously on 1 keysat_act_2017_18_19 = sat_act_2017.merge(sat_act_2018,on='states').merge(sat_act_2019,on='states')sat_act_2017_18_19.head(3)sat_act_2017_18_19.info()xxxxxxxxxx x
## Reorder all columns & rename + add Prefix as needed## Drop invidual section marks for ACT scores as we do not ## have data from other years to compare with.## Can always use sat_act_2017 dataframe for detailed act_2017 informationxxxxxxxxxxxxxxxxxxxxsat_act_2017_18_19.drop(columns=['english_act17','math_act17','reading_act17','science_act17'],inplace=True)xxxxxxxxxxsat_act_2017_18_19.columnscolumns_renamed = { 'states':'states', 'erw_sat17':'sat17_erw', 'erw_sat18':'sat18_erw', 'erw_sat19':'sat19_erw', 'math_sat17':'sat17_math', 'math_sat18':'sat18_math', 'math_sat19':'sat19_math', 'total_sat17':'sat17_total', 'total_sat18':'sat18_total', 'total_sat19':'sat19_total', 'participation_percentage_sat17':'sat17_participation_percent', 'participation_percentage_sat18':'sat18_participation_percent', 'participation_percentage_sat19':'sat19_participation_percent', 'composite_act17':'act17_total', 'composite_act18':'act18_total', 'composite_act19':'act19_total', 'participation_percentage_act17':'act17_participation_percent', 'participation_percentage_act18':'act18_participation_percent', 'participation_percentage_act19':'act19_participation_percent' }sat_act_2017_18_19.rename(columns=columns_renamed,inplace=True)sat_act_2017_18_19.head()xxxxxxxxxx## Reorder columns using .reindex()sat_act_2017_18_19 = sat_act_2017_18_19.reindex(['states', 'sat17_erw','sat18_erw','sat19_erw', 'sat17_math','sat18_math','sat19_math', 'sat17_total','sat18_total','sat19_total', 'sat17_participation_percent','sat18_participation_percent','sat19_participation_percent', 'act17_total','act18_total','act19_total', 'act17_participation_percent','act18_participation_percent','act19_participation_percent'], axis=1)xxxxxxxxxxsat_act_2017_18_19.tail()xxxxxxxxxx xxxxxxxxxx**Save the file - 2017-2018-2019 (SAT + ACT) merged - final.csv**Save the file - 2017-2018-2019 (SAT + ACT) merged - final.csv
## Index=0 to not save our index columnsat_act_2017_18_19.to_csv('data/final.csv', index=False)xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx## Exploratory Data Analysis### Summary StatisticsTranspose the output of pandas `describe` method to create a quick overview of each numeric feature.## Lets re-import our final.csv## first into master dataframeclean_sat_act_2017_18_19_master = pd.read_csv('data/final.csv')xxxxxxxxxxclean_sat_act_2017_18_19 = clean_sat_act_2017_18_19_master.copy()clean_sat_act_2017_18_19.info()desc_17_18_19 = clean_sat_act_2017_18_19.describe()## Transpose using .T or .transpose()desc_17_18_19.Tx
## lets double check any value with our initial dataframe## Ex. SAT 2017 - 2018 mean score and test totalprint(sat2017_master['Math'].std(),sat2018_master['Math'].std())xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx#### Manually calculate standard deviation$$\sigma = \sqrt{\frac{1}{n}\sum_{i=1}^n(x_i - \mu)^2}$$- Write a function to calculate standard deviation using the formula above## Calculate Standard Deviation,def checkStdDev(col): dist = 0 for item in col: dist += (item - (sum(col)/len(col))) **2 return round(float((dist / len(col)) ** 0.5),3)xxxxxxxxxx## Test Functio with 1 columncheckStdDev(act2017['composite_act17'])xxxxxxxxxx## Test using Numpy std functionround(np.std((act2017['composite_act17'])),3)xxxxxxxxxx xxxxxxxxxx- Use a **dictionary comprehension** to apply your standard deviation function to each numeric column in the dataframe. **No loops** - Assign the output to variable `sd` as a dictionary where: - Each column name is now a key - That standard deviation of the column is the value *Example Output :* `{'ACT_Math': 120, 'ACT_Reading': 120, ...}`sd as a dictionary where: Example Output : {'ACT_Math': 120, 'ACT_Reading': 120, ...}
xxxxxxxxxxclean_sat_act_2017_18_19.shapexxxxxxxxxx## Mask to cover all columns except for the categorical 'States' columnnum_col = clean_sat_act_2017_18_19.iloc[:,1:]xxxxxxxxxxnum_col.shape## test - using my defined functioncheckStdDev(num_col['act17_total']) ## Using Dictionary Comprehensionsd = {k:checkStdDev(num_col[k]) for k in num_col.keys()}sdxxxxxxxxxx## Test - using numpy's std() method{k:round(np.std((num_col[k])),3) for k in num_col.keys()}## Perfecto!xxxxxxxxxx xxxxxxxxxxDo your manually calculated standard deviations match up with the output from pandas `describe`? What about numpy's `std` method?Do your manually calculated standard deviations match up with the output from pandas describe? What about numpy's std method?
xxxxxxxxxxAnswer: The manually calculated standard deviations are preety close to the output from pandas 'describe' method. While with numpy's 'std' method, it matches perfectly.Answer:
The manually calculated standard deviations are preety close to the output from pandas 'describe' method. While with numpy's 'std' method, it matches perfectly.
xxxxxxxxxx xxxxxxxxxx#### Investigate trends in the dataUsing sorting and/or masking (along with the `.head` method to not print our entire dataframe), consider the following questions:- Which states have the highest and lowest participation rates for the: - 2017 SAT? - 2018 SAT? - 2017 ACT? - 2018 ACT?- Which states have the highest and lowest mean total/composite scores for the: - 2017 SAT? - 2018 SAT? - 2017 ACT? - 2018 ACT?- Do any states with 100% participation on a given test have a rate change year-to-year?- Do any states show have >50% participation on *both* tests either year?Based on what you've just observed, have you identified any states that you're especially interested in? **Make a note of these and state *why* you think they're interesting**.**You should comment on your findings at each step in a markdown cell below your code block**. Make sure you include at least one example of sorting your dataframe by a column, and one example of using boolean filtering (i.e., masking) to select a subset of the dataframe.Using sorting and/or masking (along with the .head method to not print our entire dataframe), consider the following questions:
Based on what you've just observed, have you identified any states that you're especially interested in? Make a note of these and state why you think they're interesting.
You should comment on your findings at each step in a markdown cell below your code block. Make sure you include at least one example of sorting your dataframe by a column, and one example of using boolean filtering (i.e., masking) to select a subset of the dataframe.
clean_sat_act_2017_18_19.iloc[:,10:13].head()xxxxxxxxxxfinal_17_18_19_sat_act = clean_sat_act_2017_18_19.set_index('states')xxxxxxxxxxfinal_17_18_19_sat_act['sat17_erw'].max() ## Maximum_erw_scorexxxxxxxxxx**Max_Min Participation (%) for each year** Max_Min Participation (%) for each year
##sat17_max_min_participationmax_min_sat17_participation_percent = final_17_18_19_sat_act[ (final_17_18_19_sat_act['sat17_participation_percent'] == final_17_18_19_sat_act['sat17_participation_percent'].max()) | (final_17_18_19_sat_act['sat17_participation_percent'] == final_17_18_19_sat_act['sat17_participation_percent'].min()) ][['sat17_total','sat17_participation_percent']].sort_values(by='sat17_participation_percent')max_min_sat17_participation_percentxxxxxxxxxx##sat18_max_min_participationmax_min_sat18_participation_percent = final_17_18_19_sat_act[ (final_17_18_19_sat_act['sat18_participation_percent'] == final_17_18_19_sat_act['sat18_participation_percent'].max()) | (final_17_18_19_sat_act['sat18_participation_percent'] == final_17_18_19_sat_act['sat18_participation_percent'].min()) ][['sat18_total','sat18_participation_percent']].sort_values(by='sat18_participation_percent')max_min_sat18_participation_percentxxxxxxxxxx##sat19_max_min_participationmax_min_sat18_participation_percent = final_17_18_19_sat_act[ (final_17_18_19_sat_act['sat19_participation_percent'] == final_17_18_19_sat_act['sat19_participation_percent'].max()) | (final_17_18_19_sat_act['sat19_participation_percent'] == final_17_18_19_sat_act['sat19_participation_percent'].min()) ][['sat19_total','sat19_participation_percent']].sort_values(by='sat19_participation_percent')max_min_sat18_participation_percent##act17_max_min_participationmax_min_act17_participation_percent = final_17_18_19_sat_act[ (final_17_18_19_sat_act['act17_participation_percent'] == final_17_18_19_sat_act['act17_participation_percent'].max()) | (final_17_18_19_sat_act['act17_participation_percent'] == final_17_18_19_sat_act['act17_participation_percent'].min()) ][['act17_total','act17_participation_percent']].sort_values(by='act17_participation_percent')max_min_act17_participation_percentxxxxxxxxxx##act18_max_min_participationmax_min_act17_participation_percent = final_17_18_19_sat_act[ (final_17_18_19_sat_act['act18_participation_percent'] == final_17_18_19_sat_act['act18_participation_percent'].max()) | (final_17_18_19_sat_act['act18_participation_percent'] == final_17_18_19_sat_act['act18_participation_percent'].min()) ][['act18_total','act18_participation_percent']].sort_values(by='act18_participation_percent')max_min_act17_participation_percentxxxxxxxxxx##act19_max_min_participationmax_min_act19_participation_percent = final_17_18_19_sat_act[ (final_17_18_19_sat_act['act19_participation_percent'] == final_17_18_19_sat_act['act19_participation_percent'].max()) | (final_17_18_19_sat_act['act19_participation_percent'] == final_17_18_19_sat_act['act19_participation_percent'].min()) ][['act19_total','act19_participation_percent']].sort_values(by='act19_participation_percent')max_min_act19_participation_percentxxxxxxxxxx xxxxxxxxxx**Max_Min Total/Composite Scores**Max_Min Total/Composite Scores
xxxxxxxxxx##sat17_max_min_Total_Scoresmax_min_sat17_total_score = final_17_18_19_sat_act[ (final_17_18_19_sat_act['sat17_total'] == final_17_18_19_sat_act['sat17_total'].max()) | (final_17_18_19_sat_act['sat17_total'] == final_17_18_19_sat_act['sat17_total'].min()) ][['sat17_total','sat17_participation_percent']].sort_values(by='sat17_total')max_min_sat17_total_scorexxxxxxxxxx##sat18_max_min_Total_Scoresmax_min_sat18_total_score = final_17_18_19_sat_act[ (final_17_18_19_sat_act['sat18_total'] == final_17_18_19_sat_act['sat18_total'].max()) | (final_17_18_19_sat_act['sat18_total'] == final_17_18_19_sat_act['sat18_total'].min()) ][['sat18_total','sat18_participation_percent']].sort_values(by='sat18_total')max_min_sat18_total_scorexxxxxxxxxx##sat19_max_min_Total_Scoresmax_min_sat19_total_score = final_17_18_19_sat_act[ (final_17_18_19_sat_act['sat19_total'] == final_17_18_19_sat_act['sat19_total'].max()) | (final_17_18_19_sat_act['sat19_total'] == final_17_18_19_sat_act['sat19_total'].min()) ][['sat19_total','sat19_participation_percent']].sort_values(by='sat19_total')max_min_sat19_total_scorexxxxxxxxxx##act17_max_min_Total_Scoresmax_min_act17_total_score = final_17_18_19_sat_act[ (final_17_18_19_sat_act['act17_total'] == final_17_18_19_sat_act['act17_total'].max()) | (final_17_18_19_sat_act['act17_total'] == final_17_18_19_sat_act['act17_total'].min()) ][['act17_total','act17_participation_percent']].sort_values(by='act17_total')max_min_act17_total_scorexxxxxxxxxx##act18_max_min_Total_Scoresmax_min_act18_total_score = final_17_18_19_sat_act[ (final_17_18_19_sat_act['act18_total'] == final_17_18_19_sat_act['act18_total'].max()) | (final_17_18_19_sat_act['act18_total'] == final_17_18_19_sat_act['act18_total'].min()) ][['act18_total','act18_participation_percent']].sort_values(by='act18_total')max_min_act18_total_score##act19_max_min_Total_Scoresmax_min_act19_total_score = final_17_18_19_sat_act[ (final_17_18_19_sat_act['act19_total'] == final_17_18_19_sat_act['act19_total'].max()) | (final_17_18_19_sat_act['act19_total'] == final_17_18_19_sat_act['act19_total'].min()) ][['act19_total','act19_participation_percent']].sort_values(by='act19_total')max_min_act19_total_scorexxxxxxxxxx## First Impressions: SAT Participation Rate Is Getting Better!x
## sat_participation 2017-2018-2019sat_participation_2017_18_19 = clean_sat_act_2017_18_19[(clean_sat_act_2017_18_19['sat17_participation_percent'] == 100) | (clean_sat_act_2017_18_19['sat18_participation_percent'] == 100) | ( clean_sat_act_2017_18_19['sat19_participation_percent'] == 100)][['states','sat17_participation_percent', 'sat18_participation_percent', 'sat19_participation_percent']]sat_participation_2017_18_19.sort_values('sat17_participation_percent', ascending=False)## 100% SAT Participation % in 2017len(clean_sat_act_2017_18_19[clean_sat_act_2017_18_19['sat17_participation_percent'] >93])## 100% SAT Participation % in 2018len(clean_sat_act_2017_18_19[clean_sat_act_2017_18_19['sat18_participation_percent'] > 93])## 100% SAT Participation % in 2019len(clean_sat_act_2017_18_19[clean_sat_act_2017_18_19['sat19_participation_percent'] > 93])## Maskingsat_act_participation_100_2017 = clean_sat_act_2017_18_19['act17_participation_percent'] == 100sat_act_participation_100_2018 = clean_sat_act_2017_18_19['act18_participation_percent'] == 100sat_act_participation_100_2019 = clean_sat_act_2017_18_19['act19_participation_percent'] == 100## act_participation 2017-2018-2019act_participation_2017_18_19 = clean_sat_act_2017_18_19[(sat_act_participation_100_2017)| (sat_act_participation_100_2018)| (sat_act_participation_100_2019)][['states', 'act17_participation_percent', 'act18_participation_percent', 'act19_participation_percent']]act_participation_2017_18_19xxxxxxxxxx**Do any states with 100% participation on a given test have a rate change year-to-year?**+ Yes, year-on-year rate change does exist for states with 100% participation.+ for SAT's Colorado, Idaho, Illinois and Rhode Island went from lower participation in 2017 to 100% by 2019. While the District of Columbia, went down from 100% to 94% in the same duration. Florida, oddly went down from 845 to 56% but showed a significant increase of almost 100% between 2018 and 2019.+ for ACT's Colorado had a significant drop from 100% in 2017 to 27% in 2019. Missouri and South Carolina exibit a downward trend for the same years. On the other hand a lot more participation is seen from the states of Ohio and Nebraksa every year. + to sum up, a lot more states happen to show participation in ACT's compared to SAT's as we go from 2017 to 2019.Do any states with 100% participation on a given test have a rate change year-to-year?
Yes, year-on-year rate change does exist for states with 100% participation.
for SAT's Colorado, Idaho, Illinois and Rhode Island went from lower participation in 2017 to 100% by 2019. While the District of Columbia, went down from 100% to 94% in the same duration. Florida, oddly went down from 845 to 56% but showed a significant increase of almost 100% between 2018 and 2019.
for ACT's Colorado had a significant drop from 100% in 2017 to 27% in 2019. Missouri and South Carolina exibit a downward trend for the same years. On the other hand a lot more participation is seen from the states of Ohio and Nebraksa every year.
to sum up, a lot more states happen to show participation in ACT's compared to SAT's as we go from 2017 to 2019.
xxxxxxxxxx ## Greater than 50% participation in both tests for any yearboth_participation50plus_2017_18_19 = clean_sat_act_2017_18_19[ ((clean_sat_act_2017_18_19['sat17_participation_percent'] > 50) & (clean_sat_act_2017_18_19['act17_participation_percent'] > 50)) | ((clean_sat_act_2017_18_19['sat18_participation_percent'] > 50) & (clean_sat_act_2017_18_19['act18_participation_percent'] > 50)) | ((clean_sat_act_2017_18_19['sat19_participation_percent'] > 50) & (clean_sat_act_2017_18_19['act19_participation_percent'] > 50))][ ['states','sat17_participation_percent','sat18_participation_percent','sat19_participation_percent', 'act17_participation_percent','act18_participation_percent','act19_participation_percent']]both_participation50plus_2017_18_19xxxxxxxxxx**Do any states show have >50% participation on both tests either year?**+ Yes, the states of Florida, Georgia, Hawaii, North Carolina, South Carolina have an average of more than 50% participation for both the tests. + INTERESTING FACT: From 2017 to 2019, Florida shows about 20% increase in SAT participation, at the same time it shows 20% drop in ACT participationDo any states show have >50% participation on both tests either year?
Yes, the states of Florida, Georgia, Hawaii, North Carolina, South Carolina have an average of more than 50% participation for both the tests.
INTERESTING FACT: From 2017 to 2019, Florida shows about 20% increase in SAT participation, at the same time it shows 20% drop in ACT participation
xxxxxxxxxx xxxxxxxxxx## Visualize the dataThere's not a magic bullet recommendation for the right number of plots to understand a given dataset, but visualizing your data is *always* a good idea. Not only does it allow you to quickly convey your findings (even if you have a non-technical audience), it will often reveal trends in your data that escaped you when you were looking only at numbers.Some recommendations on plotting:- Plots have titles- Plots have axis labels- Plots have appropriate tick labels- All text is legible in a plot- Plots demonstrate meaningful and valid relationships- Plots are interpreted to aid understandingThere is such a thing as too many plots, and there are a *lot* of bad plots. You might make some! (But hopefully not with the guided prompts below).There's not a magic bullet recommendation for the right number of plots to understand a given dataset, but visualizing your data is always a good idea. Not only does it allow you to quickly convey your findings (even if you have a non-technical audience), it will often reveal trends in your data that escaped you when you were looking only at numbers.
Some recommendations on plotting:
There is such a thing as too many plots, and there are a lot of bad plots. You might make some! (But hopefully not with the guided prompts below).
xxxxxxxxxx#### Use Seaborn's heatmap with pandas `.corr()` to visualize correlations between all numeric featuresHeatmaps are generally not appropriate for presentations, and should often be excluded from reports as they can be visually overwhelming. **However**, they can be extremely useful in identify relationships of potential interest (as well as identifying potential collinearity before modeling).*example*:```pythonsns.heatmap(df.corr())```Please take time to format your output, adding a title. Look through some of the additional arguments and options. (Axis labels aren't really necessary, as long as the title is informative)..corr() to visualize correlations between all numeric features¶Heatmaps are generally not appropriate for presentations, and should often be excluded from reports as they can be visually overwhelming. However, they can be extremely useful in identify relationships of potential interest (as well as identifying potential collinearity before modeling).
example:
sns.heatmap(df.corr())
Please take time to format your output, adding a title. Look through some of the additional arguments and options. (Axis labels aren't really necessary, as long as the title is informative).
x
## Select 'columns' - MASKS for future use (Total Scores for all years, SAT & ACT)## Heatmap-1, Histograms & Scatterplots to be used later (total)select_total = ['sat17_total','sat18_total','sat19_total','act17_total','act18_total','act19_total']## Heatmap-2, Histograms & Scatterplots to be used later (participation)select_participation = ['sat17_participation_percent','sat18_participation_percent','sat19_participation_percent', 'act17_participation_percent','act18_participation_percent','act19_participation_percent']## Heatmap-3, Histograms & Scatterplots (total + participation)select_total_participation = ['sat17_total','sat18_total','sat19_total','act17_total','act18_total','act19_total', 'sat17_participation_percent','sat18_participation_percent','sat19_participation_percent', 'act17_participation_percent','act18_participation_percent','act19_participation_percent'] xxxxxxxxxx## Check correlation and create masks for both heapmapsat_total_corr = final_17_18_19_sat_act.loc[:,select_total].corr()sat_participation_corr = final_17_18_19_sat_act.loc[:,select_participation].corr()sat_both_total_participation_corr = final_17_18_19_sat_act.loc[:,select_total_participation].corr()## Plot Heatmap-1 (total SAT & ACT)sns.set(font_scale=1.4)plt.figure(figsize=(9,7))sns.heatmap(sat_total_corr,annot=True, linecolor='white',linewidths=3, cmap="rocket")plt.xticks(rotation=20)plt.title("SAT and ACT Total Score for 2017-2018-2019",pad=20);xxxxxxxxxx__Heatmap-1 for Total score for each year:__+ Strong positive correlation is seen between all SAT scores for different years, and even better correlation among ACT scores for different years.+ While SAT and ACT scores have clear negative correlation, but not strong.Heatmap-1 for Total score for each year:
xxxxxxxxxx xxxxxxxxxx## Plot Heatmap-2 (Participation SAT & ACT)sns.set(font_scale=1.4)plt.figure(figsize=(9,7))sns.heatmap(sat_participation_corr,annot=True, linecolor='white',linewidths=3, cmap="rocket")plt.title("SAT and ACT Participation (%) 2017-2018-2019",pad=20);xxxxxxxxxx__Heatmap-2 for Participation % for each year:__+ Very Strong negative Correlation between SAT and ACT participation rates - this tells us that different states prefer different testing systems.Heatmap-2 for Participation % for each year:
xxxxxxxxxx x
## Plot Heatmap-3 (total and Participation for all years and both tests)sns.set(font_scale=1.8)plt.figure(figsize=(21,19))sns.heatmap(sat_both_total_participation_corr,annot=True, linecolor='white',linewidths=3, cmap="rocket")plt.title("SAT and ACT Total Score for 2017-2018-2019",pad=20);xxxxxxxxxx**HEATMAP-3 Total+Participation combined EXPLAINATION**+ Strong negative correlation between participation and total - this confirms smaller participation leads to higher total and viceversa HEATMAP-3 Total+Participation combined EXPLAINATION
xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx#### Define a custom function to subplot histogramsWe have data for two tests for two years. We only have composite (and not subtest scores) for the 2018 ACT. We should write a function that will take the names of 2+ columns and subplot histograms. While you can use pandas plotting or Seaborn here, matplotlib gives you greater control over all aspects of your plots.[Helpful Link for Plotting Multiple Figures](https://matplotlib.org/users/pyplot_tutorial.html#working-with-multiple-figures-and-axes)Here's some starter code:We have data for two tests for two years. We only have composite (and not subtest scores) for the 2018 ACT. We should write a function that will take the names of 2+ columns and subplot histograms. While you can use pandas plotting or Seaborn here, matplotlib gives you greater control over all aspects of your plots.
Helpful Link for Plotting Multiple Figures
Here's some starter code:
xxxxxxxxxxdef subplot_histograms(dataframe, list_of_columns, list_of_titles, list_of_xlabels, list_of_ylabels): sns.set(font_scale=1) nrows = int(np.ceil(len(list_of_columns)/3)) # Makes sure you have enough rows fig, ax = plt.subplots(nrows=nrows, ncols=3, figsize=(15,10)) # You'll want to specify your figsize ax = ax.ravel() # Ravel turns a matrix into a vector, which is easier to iterate for i, column in enumerate(list_of_columns): # Gives us an index value to get into all our lists ax[i].hist(dataframe[column]) # feel free to add more settings ax[i].set_title(list_of_titles[i]) ax[i].set_xlabel(list_of_xlabels[i]) ax[i].set_ylabel(list_of_ylabels[i]) plt.tight_layout() # Set titles, labels, etc here for each subplotxxxxxxxxxx#### Plot and interpret histograms For each of the following:- Participation rates for SAT & ACT- Math scores for SAT & ACT- Reading/verbal scores for SAT & ACTFor each of the following:
## Function parameter listxlabels_participation = ['','participation_%',''] * 2ylabels_participation = ['Frequency','',''] * 2title_participation = ['2017_SAT_Participation%','2018_SAT_Participation%','2019_SAT_Participation%','2017_ACT_Participation%','2018_ACT_Participation%','2019_ACT_Participation%']## print subplots using custom functionsubplot_histograms(final_17_18_19_sat_act,select_participation,title_participation,xlabels_participation,ylabels_participation)## Test1 if the histograms printed correctlylen(sat2017[sat2017['participation_percentage_sat17']<=20])## Test2 if the histograms printed correctlylen(act2018[(act2018['participation_percentage_act18'] > 50) & (act2018['participation_percentage_act18'] <60)])xxxxxxxxxx**PLOT#1 Participation rates for SAT & ACT**1. For SAT Participation, more that 50% of the states have either very high or extremenly very low participation2. Negative correlation between SAT and ACT participation can be confirmed using this plot. States with 0-10% participation for SAT's and 90-100% participation for ACT's.3. High's and Low's for SAT's and ACT's clearly do not match. (Top 3 versus Bottom 3 plots)PLOT#1 Participation rates for SAT & ACT
x
math_sat_2017_mask = sat_act_2017[sat_act_2017['math_sat17']>200]['math_sat17']sns.set(font_scale=1)plt.figure(figsize=(7,5))sns.distplot(math_sat_2017_mask,bins=6)plt.title('Distribution for 2017 Math SAT');sat_act_2017['math_sat17'].mean()x
sns.set(font_scale=1)plt.figure(figsize=(7,5))sns.distplot(sat_act_2017['math_act17'],bins=9)plt.title('Distribution for 2017 Math ACT');xxxxxxxxxx**PLOT#2 Math Scores for SAT & ACT**+ SAT - Mean and Median fall almost at the same point, we can say that the scores are normally distributed (Mean=Median=548)+ ACT - Similar to the plot for SAT scores, ACT scores are also normally distributed (mean=median=21)PLOT#2 Math Scores for SAT & ACT
xxxxxxxxxx x
sns.set(font_scale=1)plt.figure(figsize=(7,5))sns.distplot(sat_act_2017['erw_sat17'],bins=5)plt.title('SAT 2017 ERW Score');x
sns.set(font_scale=1)plt.figure(figsize=(7,5))sns.distplot(sat_act_2017['reading_act17'],bins=6)plt.title('ACT 2017 Reading Score');xxxxxxxxxx## Test - Meansat_act_2017['reading_act17'].mean()xxxxxxxxxx**PLOT#3 Reading/verbal Scores for SAT & ACT**+ SAT - for ERW Score, Mean > Median, Therefore we can say the distribution is positively skewed. (almost 2 modes)+ ACT - The scores are normally distributed (Mean=Median=22)PLOT#3 Reading/verbal Scores for SAT & ACT
xxxxxxxxxx xxxxxxxxxx#### Plot and interpret scatter plotsFor each of the following:- SAT vs. ACT math scores for 2017- SAT vs. ACT verbal/reading scores for 2017- SAT vs. ACT total/composite scores for 2017- Total scores for SAT 2017 vs. 2018- Composite scores for ACT 2017 vs. 2018Plot the two variables against each other using matplotlib or SeabornYour plots should show:- Two clearly labeled axes- A proper title- Using colors and symbols that are clear and unmistakable**Feel free to write a custom function, and subplot if you'd like.** Functions save both time and space.For each of the following:
Plot the two variables against each other using matplotlib or Seaborn
Your plots should show:
Feel free to write a custom function, and subplot if you'd like. Functions save both time and space.
xxxxxxxxxxtemp_sat_act_2017 = sat_act_2017.copy()xxxxxxxxxx## Creating a mask to use in plots further down the linetemp_sat_act_2017['math_sat17'] = temp_sat_act_2017[temp_sat_act_2017['math_sat17']>200]['math_sat17']xxxxxxxxxxinterested_pairs = temp_sat_act_2017.loc[:,['states','math_sat17','math_act17','erw_sat17','reading_act17','total_sat17','composite_act17']]xxxxxxxxxx sns.set(font_scale=1.3)plt.figure(figsize=(10,10))sns.lmplot(x='math_sat17',y='math_act17',hue='states',data=interested_pairs,scatter_kws={'s':40},legend=False,height=7,)plt.hlines(y = np.mean(interested_pairs['math_act17']),xmin=475,xmax=625, color='r')plt.vlines(x = np.mean(interested_pairs['math_sat17']),ymin=18,ymax=25, color='r')plt.xlabel('SAT ( Score out of: 800)')plt.ylabel('ACT (Score out of: 36)')plt.title('2017 - Math ACT-SAT scores by State (Colours)');## Legend not used purposely to avoid any confusions## 51 different colour dots representing each statelen(interested_pairs['states'])xxxxxxxxxx**Scatterplot for Plot# 1 (Math Scores SAT vs ACT, 2017)**+ Removed the outlier for 'math_sat17' column.+ Little negative collinearity, but no clear relationship among data+ I can see 3 groups of scorers,<br> o 1. states with very high average ACT_Math score and very low average SAT_Math scores<br> o 2. states with very low average ACT_Math score and very high average SAT_Math scores<br> o 3. states with average scores for both the tests <br> o This kind of backsup our assumption, probably the scores are directly related to states participation.Scatterplot for Plot# 1 (Math Scores SAT vs ACT, 2017)
xxxxxxxxxx sns.set(font_scale=1.3)plt.figure(figsize=(10,10))sns.lmplot(x='erw_sat17',y='reading_act17',hue='states',data=interested_pairs,scatter_kws={'s':40},legend=False,height=7)plt.hlines(y = np.mean(interested_pairs['reading_act17']),xmin=500,xmax=650, color='r')plt.vlines(x = np.mean(interested_pairs['erw_sat17']),ymin=18,ymax=25, color='r')plt.xlabel('SAT ( Score out of: 800)')plt.ylabel('ACT (Score out of: 36)')plt.title('2017 - Verbal/Reading ACT-SAT scores by State (Colours)');xxxxxxxxxx**Scatterplot for Plot# 2 (Verbal/Reading Scores SAT vs ACT, 2017)**+ Similar to Math scores, little negative collinearity, but no clear relationship among data+ I can see 3 groups of scorers,<br> o 1. states with very high average ACT_reading score and very low average SAT_erw scores<br> o 2. states with very low average ACT_reading score and very high average SAT_erw scores<br> o 3. states with average scores for both the tests<br> o This kind of backsup our assumption, probably the scores are directly related to states participation.Scatterplot for Plot# 2 (Verbal/Reading Scores SAT vs ACT, 2017)
xxxxxxxxxx x
sns.set(font_scale=1.3)plt.figure(figsize=(10,10))sns.lmplot(x='total_sat17',y='composite_act17',hue='states',data=interested_pairs,scatter_kws={'s':40},legend=False,height=7)plt.hlines(y = np.mean(interested_pairs['composite_act17']),xmin=950,xmax=1300, color='r')plt.vlines(x = np.mean(interested_pairs['total_sat17']),ymin=18,ymax=25, color='r')plt.xlabel('SAT ( Scored out of: 800)')plt.ylabel('ACT (Scored out of: 36)')plt.title('2017 - Total/Composite ACT-SAT scores by State (Colours)');xxxxxxxxxx**Scatterplot for Plot# 3 (Total/Composite Scores SAT vs ACT, 2017)**+ Similar to Math/Reading scores, little negative collinearity, but no clear relationship among data+ I can see 3 groups of scorers,<br> o 1. states with very high average ACT_Composite score and very low average SAT_Total scores<br> o 2. states with very low average ACT_Composite score and very high average SAT_Total scores<br> o 3. states with average scores for both the tests<br> o This kind of backsup our assumption, probably the scores are directly related to states participation.Scatterplot for Plot# 3 (Total/Composite Scores SAT vs ACT, 2017)
xxxxxxxxxx xxxxxxxxxx## select_total,select_participation,select_total_particioation - masks from Heatmap section being reusedtotal_scores = clean_sat_act_2017_18_19[select_total]total_participation = clean_sat_act_2017_18_19[select_participation]total_participation = clean_sat_act_2017_18_19[select_total_participation]xxxxxxxxxxsns.set(font_scale=1.3)fig,axes = plt.subplots(nrows=1,ncols=3, figsize=(18,5))axes[0].scatter(total_scores['sat17_total'],total_scores['sat18_total'])axes[0].set_title('SAT Total Scores: 2017 Vs 2018')axes[0].set_xlabel('2017')axes[0].set_ylabel('2018')axes[1].scatter(total_scores['sat18_total'],total_scores['sat19_total'])axes[1].set_title('SAT Total Scores: 2018 Vs 2019')axes[0].set_xlabel('2018')axes[0].set_ylabel('2019')axes[2].scatter(total_scores['sat17_total'],total_scores['sat19_total'])axes[2].set_title('SAT Total Scores: 2017 Vs 2019');axes[0].set_xlabel('2017')axes[0].set_ylabel('2019')plt.tight_layout();### ACT Composite 2017-2018-2019 Correlation using Scatterplotsns.set(font_scale=1.3)fig,axes = plt.subplots(nrows=1,ncols=3, figsize=(18,5))axes[0].scatter(total_scores['act17_total'],total_scores['act18_total'])axes[0].set_title('ACT Composite Scores: 2017 Vs 2018')axes[0].set_xlabel('2017')axes[0].set_ylabel('2018')axes[1].scatter(total_scores['act18_total'],total_scores['act19_total'])axes[1].set_title('ACT Composite Scores: 2018 Vs 2019')axes[0].set_xlabel('2018')axes[0].set_ylabel('2019')axes[2].scatter(total_scores['act17_total'],total_scores['act19_total'])axes[2].set_title('ACT Composite Scores: 2017 Vs 2019');axes[0].set_xlabel('2017')axes[0].set_ylabel('2019')plt.tight_layout();## Test - SAT vs ACT Participation Testssns.set(font_scale=1.3)fig,axes = plt.subplots(nrows=1,ncols=3, figsize=(18,5), dpi=600)axes[0].scatter(total_participation['sat17_participation_percent'],total_participation['act17_participation_percent'])axes[0].set_title('SAT vs ACT Participation: 2017')axes[0].set_xlabel('SAT')axes[0].set_ylabel('ACT')axes[1].scatter(total_participation['sat18_participation_percent'],total_participation['act18_participation_percent'])axes[1].set_title('SAT vs ACT Participation: 2018')axes[0].set_xlabel('SAT')axes[0].set_ylabel('ACT')axes[2].scatter(total_participation['sat19_participation_percent'],total_participation['act19_participation_percent'])axes[2].set_title('SAT vs ACT Participation: 2018');axes[0].set_xlabel('SAT')axes[0].set_ylabel('ACT')plt.tight_layout();x
##Test -- Supports our plot that all total_scores.corr() xxxxxxxxxx**Scatterplot for Plot# 4** (Includes Total Comparisons between, Total scores in SAT, Composite Scores in ACT and SAT/ACT Participation + SAT Total Scores: 2017 vs 2018 vs 2019 - strong positive correlation, tells us that the states are consistent with the SAT scores + ACT Composite Scores: 2017 vs 2018 vs 2019 - strong positive correlation, tells us that the states are consistent with the ACT scores + SAT Vs ACT Participation: 2017 vs 2018 vs 2019 - clear negative correlation, tells us that the states with higher SAT participation have lower ACT participation rate and viceversa. Not many states participate in both the test. States include Florida, Georgia, Huwaii, North Carolina and South CarolinaScatterplot for Plot# 4 (Includes Total Comparisons between, Total scores in SAT, Composite Scores in ACT and SAT/ACT Participation
xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx#### Plot and interpret boxplotsFor each numeric variable in the dataframe create a boxplot using Seaborn. Boxplots demonstrate central tendency and spread in variables. In a certain sense, these are somewhat redundant with histograms, but you may be better able to identify clear outliers or differences in IQR, etc.Multiple values can be plotted to a single boxplot as long as they are of the same relative scale (meaning they have similar min/max values).Each boxplot should:- Only include variables of a similar scale- Have clear labels for each variable- Have appropriate titles and labelsFor each numeric variable in the dataframe create a boxplot using Seaborn. Boxplots demonstrate central tendency and spread in variables. In a certain sense, these are somewhat redundant with histograms, but you may be better able to identify clear outliers or differences in IQR, etc.
Multiple values can be plotted to a single boxplot as long as they are of the same relative scale (meaning they have similar min/max values).
Each boxplot should:
xxxxxxxxxx## Grab all similar columns into 1 dataframe for all 3 yearserw_sat_17_18_19_scores = clean_sat_act_2017_18_19.iloc[:,1:4]math_sat_17_18_19_scores = clean_sat_act_2017_18_19.iloc[:,4:7]total_sat_17_18_19_scores = clean_sat_act_2017_18_19.iloc[:,7:10]part_sat_17_18_19_scores = clean_sat_act_2017_18_19.iloc[:,10:13]total_act_17_18_19_scores = clean_sat_act_2017_18_19.iloc[:,13:16]part_act_17_18_19_scores = clean_sat_act_2017_18_19.iloc[:,16:]## Box Plot 1 - ERW (SAT) 2017-2018-2019 - Comparativesns.set(font_scale=1)fig = plt.figure(figsize=(7,4))sns.boxplot(data = erw_sat_17_18_19_scores, orient='h')plt.title('SAT Scores (ERW) 2017-2018-2019');erw_sat_17_18_19_scores.mean().mean()xxxxxxxxxx **Box Plot# 1 (SAT-ERW)** + Consistent scores across all 3 years, no outliers in the dataset + Maintains average grades across US Avg. Median ~ 553 while Avg. Mean Score ~ 5640Box Plot# 1 (SAT-ERW)
xxxxxxxxxx ## Box Plot 2 - Math (SAT) 2017-2018-2019 - Comparativesns.set(font_scale=1)fig = plt.figure(figsize=(7,4))sns.boxplot(data = math_sat_17_18_19_scores, orient='h')plt.title('SAT Scores (Math) 2017-2018-2019');xxxxxxxxxxmath_sat_17_18_19_scores.median().mean()xxxxxxxxxx **Box Plot2 1 (SAT-Math)** + More than 75% of the states score well above average for this section of SAT + Maintains average grades across US Avg. Median ~ 546 while Avg. Mean Score ~ 552 (Right Skew)Box Plot2 1 (SAT-Math)
xxxxxxxxxx x
## Box Plot 3 - Total Score (SAT) 2017-2018-2019 - Comparativesns.set(font_scale=1)fig = plt.figure(figsize=(7,4))sns.boxplot(data = total_sat_17_18_19_scores, orient='h')plt.title('SAT Scores (Total) 2017-2018-2019');total_sat_17_18_19_scores.median().mean()xxxxxxxxxx **Box Plot# 3 (SAT-Total)** + Quiet big interquartile range, but almost 25% of the states score well above the average. + Maintains average grades across US Avg. Median ~ 1100 while Avg. Mean Score ~ 1120 (Right Skew)Box Plot# 3 (SAT-Total)
xxxxxxxxxx xxxxxxxxxx## Participation Rate: Box Plots tells it all!## Participation for SAT & ACT is all Percentage between 0-100 so we can combine and plot them together for comaprisonparticipation_per_comb = pd.concat([part_sat_17_18_19_scores, part_act_17_18_19_scores],axis=1)## Box Plot 4 - Participation % (SAT) 2017-2018-2019 - Comparativesns.set(font_scale=1)fig = plt.figure(figsize=(8,10))sns.boxplot(data = participation_per_comb)plt.title("SAT Score (Participation &) 2017 - 2018 - 2019 ")plt.xlabel('<-- SAT (2017 - 2018 - 2019) ACT -->')plt.xticks([]);## SAT participation more than 70% - # of stateslen(part_sat_17_18_19_scores[(part_sat_17_18_19_scores['sat17_participation_percent'] > 70) | (part_sat_17_18_19_scores['sat18_participation_percent'] > 70) | (part_sat_17_18_19_scores['sat19_participation_percent'] > 70)])## ACT participation less than 30% - # of stateslen(part_act_17_18_19_scores[(part_act_17_18_19_scores['act17_participation_percent'] < 30) | (part_act_17_18_19_scores['act18_participation_percent'] < 30) | (part_act_17_18_19_scores['act19_participation_percent'] < 30)])part_act_17_18_19_scores.mean().mean()xxxxxxxxxx **Box Plot# 4 (SAT & ACT -Participation%)** + Number of states with Avg. SAT participation > 70 is exactly same as states with Avg. ACT participation < 30 + SAT - US Avg. Median ~ 48% while Avg. Mean Score ~ 44% (Left Skew) + ACT - US Avg. Median ~ 63% while Avg. Mean Score ~ 62% (Left Skew) + This tells us on Avg. atleast 60% of States participate in ACT while less than 50% of the States participate in SAT + Lastly, both of them show a consistently increasing participation rateBox Plot# 4 (SAT & ACT -Participation%)
xxxxxxxxxx xxxxxxxxxx## Box Plot 5 - Composite Score (ACT) 2017-2018-2019 - Comparativesns.set(font_scale=1)fig = plt.figure(figsize=(7,4))sns.boxplot(data = total_act_17_18_19_scores, orient='h')plt.title('ACT Scores (Total) 2017-2018-2019');## Distributionsns.set(font_scale=1.3)fig,axes = plt.subplots(nrows=1,ncols=3, figsize=(18,5))sns.histplot(ax=axes[0],data= total_act_17_18_19_scores.iloc[:,0], bins=6, kde=True)sns.histplot(ax=axes[1],data= total_act_17_18_19_scores.iloc[:,1], bins=6, kde=True)sns.histplot(ax=axes[2],data= total_act_17_18_19_scores.iloc[:,2], bins=6, kde=True)xxxxxxxxxxtotal_act_17_18_19_scores.iloc[:,2].median()xxxxxxxxxx **Box Plot# 5 (ACT-Total)** + Quiet big interquartile range, but almost 25% of the states score well above the average. + ACT Total: 2017 (Mean ~ Median ~ 21.5), 2018 (Mean ~ Median ~ 21.3) and 2019 (Mean ~ 21.4, Median ~ 21.1) + we can say that over the 3 year period Median dropped a bit, while Mean remaind steady.Box Plot# 5 (ACT-Total)
xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx#### Feel free to do additional plots below*(do research and choose your own chart types & variables)*Are there any additional trends or relationships you haven't explored? Was there something interesting you saw that you'd like to dive further into? It's likely that there are a few more plots you might want to generate to support your narrative and recommendations that you are building toward. **As always, make sure you're interpreting your plots as you go**.(do research and choose your own chart types & variables)
Are there any additional trends or relationships you haven't explored? Was there something interesting you saw that you'd like to dive further into? It's likely that there are a few more plots you might want to generate to support your narrative and recommendations that you are building toward. As always, make sure you're interpreting your plots as you go.
## We did not explore the colliearity between ERW and Math scores,final_17_18_19_sat_act.iloc[:,:6].corr()xxxxxxxxxxplt.figure(figsize=(10,8))sns.heatmap(final_17_18_19_sat_act.iloc[:,:6].corr(), cmap='viridis', annot=True,linecolor='white', linewidths=3)xxxxxxxxxx**Interpretation**+ Math and ERW possess a very strong correlation for SAT's. Interpretation
xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx#### (Optional): Using Tableau, create a choropleth map for each variable using a map of the US. Save this plot as an image file in an images directory, provide a relative path, and insert the image into notebook in markdown.Save this plot as an image file in an images directory, provide a relative path, and insert the image into notebook in markdown.
xxxxxxxxxxxxxxxxxxxx### Top 3 States based on Participation Rates for SAT/ACT across all 3 yearsxxxxxxxxxxBased upon your observations, choose **three** states that demonstrate interesting trends in their SAT and/or ACT participation rates. Spend some time doing outside research on state policies that might influence these rates, and summarize your findings below. **Feel free to go back and create new plots that highlight these states of interest**. If you bring in any outside tables or charts, make sure you are explicit about having borrowed them. If you quote any text, make sure that it renders as being quoted. (Make sure that you cite your sources -- check with you local instructor for citation preferences).Based upon your observations, choose three states that demonstrate interesting trends in their SAT and/or ACT participation rates. Spend some time doing outside research on state policies that might influence these rates, and summarize your findings below. Feel free to go back and create new plots that highlight these states of interest. If you bring in any outside tables or charts, make sure you are explicit about having borrowed them. If you quote any text, make sure that it renders as being quoted. (Make sure that you cite your sources -- check with you local instructor for citation preferences).
x
colorado_all = final_17_18_19_sat_act.loc[['Colorado'],:]illinois_all = final_17_18_19_sat_act.loc[['Illinois'],:]rhodeisland_all = final_17_18_19_sat_act.loc[['Rhode Island'],:]hawaii_all = final_17_18_19_sat_act.loc[['Hawaii'],:] distcolumbia_all = final_17_18_19_sat_act.loc[['District Of Columbia'],:]##1 Illinois - Tremendous Growth in SAT participationillinois_all[['sat17_participation_percent','sat18_participation_percent','sat19_participation_percent']]x
##2 Colorado - Tremendous Drop in ACT participationcolorado_all[['act17_participation_percent','act18_participation_percent','act19_participation_percent']]##3 Hawaii - Best all round, as part of the School Board, I strongly recommend focusing on States like Hawaiihawaii_all[['sat17_participation_percent','sat18_participation_percent','sat19_participation_percent']]hawaii_all[['act17_participation_percent','act18_participation_percent','act19_participation_percent']]xxxxxxxxxx ##4 Rhode Island - Tremendous Growth in SAT participationrhodeisland_all[['sat17_participation_percent','sat18_participation_percent','sat19_participation_percent']]##5 District of Columbia - Decline in SAT participation - School Board should focus on this this State too.distcolumbia_all[['sat17_participation_percent','sat18_participation_percent','sat19_participation_percent']]xxxxxxxxxx xxxxxxxxxx**Top States that demonstrate interesting trends are**,Top 3,+ Illinois - Tremendous Growth in SAT participation+ Colorado - Tremendous Drop in ACT participation+ Hawaii - Best all rounder, based on balance between SAT and ACT participation * (Hawaii is a very good example of a State that we should focus on, ACT participation rates are constantly dropping, while SAT participation rates are almost the same, for 2017, 2018 and 2019.)Other,+ Rhode Island - Tremendous Growth in SAT participation+ District of Columbia - Decline in SAT participation * (Another good State for the School Board to focus on should be District of Columbia, loosing already gained participation is always a matter of deep concern.)Top States that demonstrate interesting trends are,
Top 3,
Other,
xxxxxxxxxx x
Based on your exploration of the data, what are you key takeaways and recommendations? Choose one state with a lower participation rate and provide a suggestion for how the College Board might increase participation amongst graduating seniors in this state. Are there additional data you desire that would better inform your investigations?Based on your exploration of the data, what are you key takeaways and recommendations? Choose one state with a lower participation rate and provide a suggestion for how the College Board might increase participation amongst graduating seniors in this state. Are there additional data you desire that would better inform your investigations?
## SAT Vs ACT Participation,sns.set(font_scale=1)fig = plt.figure(figsize=(8,10))sns.boxplot(data = participation_per_comb)plt.title("SAT Score (Participation &) 2017 - 2018 - 2019 ")plt.xlabel('<-- SAT (2017 - 2018 - 2019) ACT -->')plt.xticks([]);## SAT participation more than 70% - # of stateslen(part_sat_17_18_19_scores[(part_sat_17_18_19_scores['sat17_participation_percent'] > 70) | (part_sat_17_18_19_scores['sat18_participation_percent'] > 70) | (part_sat_17_18_19_scores['sat19_participation_percent'] > 70)])xxxxxxxxxx## ACT participation less than 30% - # of stateslen(part_act_17_18_19_scores[(part_act_17_18_19_scores['act17_participation_percent'] < 30) | (part_act_17_18_19_scores['act18_participation_percent'] < 30) | (part_act_17_18_19_scores['act19_participation_percent'] < 30)])part_act_17_18_19_scores.mean().mean()## Total Score versus Participation %sns.set(font_scale=1)plt.figure(figsize=(10,8))sns.heatmap(sat_both_total_participation_corr, linecolor='white',linewidths=3, cmap='Spectral')plt.xticks( color='red')plt.yticks(color='red')plt.title("SAT and ACT Total Score for 2017-2018-2019",pad=20);## BiModal Distributionssubplot_histograms(final_17_18_19_sat_act,select_participation,title_participation,xlabels_participation,ylabels_participation)## Requires Attention! -- Hawaiihawaii_all[['sat17_participation_percent','sat18_participation_percent','sat19_participation_percent']]hawaii_all[['act17_participation_percent','act18_participation_percent','act19_participation_percent']]xxxxxxxxxx ## Requires Attention! -- District of Columbiadistcolumbia_all[['sat17_participation_percent','sat18_participation_percent','sat19_participation_percent']]xxxxxxxxxx xxxxxxxxxx **Final Statement - Conclusion** + From the box plot above it's clear that on Avg. atleast 60% of States participate in ACT while less than 50% of the States participate in SAT. + The strong non-collinearity between Total Score and Participation Percentages is a matter of concern too. + SAT - Math and ERW scores possess strong colliearity + Most of the Distributions, specially various Participation Rates and the Total Scores are BiModal in nature i.e not distribued normally, possible factors like mandatory SAT/ACT tests classifies candidates into High Scoring and Low Scoring Candidates. Some have all the facilities and want to take the test, while the other just have to.Final Statement - Conclusion
xxxxxxxxxx **Recommendations**+ Investigate what strategies / implementations went right for States with increasing participation rates. This includes States like New York, Maryland, New Jersey, Maine, West Virginia, Colorado, Florida, Idaho, Illinois and Rhode Island. Like this one [SAT scores drop for 2019 class, but participation rises through testing in schools][1]+ Research show giving 'FREE' SAT tests is a key strategy and should be applied for all states. <br>+ Focusing more on States like Hawaii might help. * Hawaii displayed alsmost same Participation % for 3 years (55% -> 56% -> 54%) for SAT, while the ACT participation rates have dropped from 90% in 2017 to 80% in 2019, unlike other states where the participation rates usually transfer from ACT to SAT and vice-versa.+ District of Columbia, is another example of a state where SAT participation rates dropped from 100% in 2017 to 94% in 2019. We should definitely get some insights on what went wrong.**Based on Research**+ Previous changes like Dropping Essay Requirement, changing Exam Duration to 3hrs and engage into contracts with states and schools to make SAT's affordable are all in right direction. We should expand these ideas to other states with low SAT scores too. DataFrame right underneath this section backs this statement. [Reference - 2016 regulations][2][1]: https://www.washingtonpost.com/local/education/sat-scores-drop-for-2019-class-but-participation-rises-through-testing-in-schools/2019/09/23/332fc4d0-de11-11e9-8dc8-498eabc129a0_story.html "FREE SAT Strategy"[2]: https://www.washingtonpost.com/local/education/sat-scores-drop-for-2019-class-but-participation-rises-through-testing-in-schools/2019/09/23/332fc4d0-de11-11e9-8dc8-498eabc129a0_story.html "2016 Regulations for SAT's"Recommendations
Based on Research
## Participation Growth after new regulations were passed in 2016sat_participation_2017_18_19.sort_values('sat17_participation_percent', ascending = False)xxxxxxxxxx x
**Scope of this Project**+ I would like to consider different factors outside of this dataset including but not limited to Racial Impact [SAT's Racial Impact][3][3]: https://www.insidehighered.com/news/2015/10/27/study-finds-race-growing-explanatory-factor-sat-scores-california "SAT's Racial Impact"Scope of this Project
xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx xxxxxxxxxx## Bonus: Descriptive and Inferential Statisticsxxxxxxxxxx#### Summarizing DistributionsAbove, we used pandas `describe` to provide quick summary statistics of our numeric columns. We also demonstrated many visual relationships.As data scientists, having a complete understanding of data is imperative prior to modeling.While we will continue to build our analytic tools, we know that measures of *central tendency*, *spread*, and *shape/skewness* provide a quick summary of distributions.For each variable in your data, summarize the underlying distributions (in words & statistics) - Be thorough in your verbal description of these distributions. - Be sure to back up these summaries with statistics.Above, we used pandas describe to provide quick summary statistics of our numeric columns. We also demonstrated many visual relationships.
As data scientists, having a complete understanding of data is imperative prior to modeling.
While we will continue to build our analytic tools, we know that measures of central tendency, spread, and shape/skewness provide a quick summary of distributions.
For each variable in your data, summarize the underlying distributions (in words & statistics)
xxxxxxxxxxAnswers:Answers:
xxxxxxxxxx#### We generally assuming that data we sample from a population will be normally distributed. Do we observe this trend?xxxxxxxxxxAnswer:Answer:
xxxxxxxxxxDoes This Assumption Hold for: - Math - Reading - RatesExplain your answers for each distribution and how you think this will affect estimates made from these data.Does This Assumption Hold for: - Math - Reading - Rates Explain your answers for each distribution and how you think this will affect estimates made from these data.
xxxxxxxxxxAnswer:Answer:
xxxxxxxxxx#### Estimate Limits of DataSuppose we only seek to understand the relationship between SAT and ACT participation rates in 2017. ##### Does it make sense to conduct statistical inference given these data specifically? Why or why not?*(think about granularity, aggregation, the relationships between populations size & rates...consider the actually populations these data describe in answering this question)*Suppose we only seek to understand the relationship between SAT and ACT participation rates in 2017.
Why or why not?
(think about granularity, aggregation, the relationships between populations size & rates...consider the actually populations these data describe in answering this question)
xxxxxxxxxxAnswer:Answer:
xxxxxxxxxx##### Is it appropriate to compare *these* specific SAT and ACT math scores? Why or why not?Why or why not?
xxxxxxxxxxAnswer:Answer:
xxxxxxxxxx#### Statistical Evaluation of Distributions **If you feel it's appropriate**, using methods we discussed in class, run hypothesis tests to compare variables of interest in our dataset. If you feel it's appropriate, using methods we discussed in class, run hypothesis tests to compare variables of interest in our dataset.
# Code: